-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Vinayak/moe final hashem #127
Conversation
… vinayak/moe_final
To ensure we we don't run prefills repeatedly during decode, provide a mechanism to queue up a certain number of prefills before executing. VLLM_SCHED_PREFILL_COUNT will be the minimum batch count to specify before executing. One caveat, the --scheduler-delay-factor should be used to enforce a longer prefill scheduling value. This will be set to the value in VLLM_SCHED_PREFILL_COUNT, if not explicitly provided. The need for this exists because an uneven number of prefills can lead to the queue never reaching the VLLM_SCHED_PREFILL_COUNT. Causing the server to hang
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
This PR combined vinayak/moe_final with PR:#126, kernel from hashem
Note: to enable triton optimization, need download triton code:
you can use benchmark in
benchmarks/kernels/benchmark_mixtral_moe_dec.py
to do benchmark